Fix audit findings (non-docs): spec-conformance, bugs, security, refactor#112
Conversation
Protocol/spec conformance: - Lease-expiry termination now emits final_status "error" with LEASE_EXPIRED (#74) - Enforce max_runtime_sec via a watchdog emitting TIMEOUT/"timed_out" (#89) - Release idempotency claim on failed accept so identical retries are not poisoned with DUPLICATE_KEY (#90) - Canonical idempotency fingerprint: sort JsonNode properties (#91) - Subscriber events use the subscriber session's own event_seq space (#76) - Invalid/undecodable job.submit now returns INVALID_REQUEST (#78) - Idempotent replay returns the budget captured at acceptance (#79) - add session.close/session.closed (alias session.bye) ack (#96) - add job.cancelled ack; cancel of unknown job -> JOB_NOT_FOUND, no silent drops (#95) - Unknown credential scheme decodes to UNKNOWN instead of throwing (#97) - list_jobs maps only the requested page (#83) - Lease.contains compares cost.budget numerically; add LeaseSubset validator (#77) Security: - credential_rotated is no longer fanned out to subscribers (#75) - subscribe history replay re-envelopes events with the subscriber's session id and seq, and redacts credential_rotated (#94) - rotateCredential tracks/revokes surplus minted credentials and fails loudly on empty issue rather than aliasing a revoked handle (#98) Bugs: - runJob catch blocks guard terminal emission to avoid duplicate job.error (#92) - terminal job.result/job.error fan out to subscriber sessions (#93) - bound per-job event history and evict terminal jobs after the resume window; unlink closed sessions from subscriber lists (#101) - top-level errors echo the originating request id; client correlates them so a list/subscribe error never fails an unrelated submit (#102) - handshake uses CAS and volatile heartbeat handle to avoid resurrecting a closed session or leaking a heartbeat task (#103) - start worker before scheduling watchdog; cancelled() covers all terminal states (#104) - client completes subscribe publishers when a job terminates (#105) - blocking submit is bounded by a timeout and rejects dispatch-thread reentrancy; add submitAsync (#106) - ArcpRuntime.accept inserts the session before start() to avoid zombie entries (#107) - cost.budget.* gauges are not treated as spend; guard null metric value (#108) - ResultStream rejects divergent duplicate pending chunks, tolerates identical (#111) Adds protocol/unit tests covering the above (#33). Co-authored-by: Cursor <cursoragent@cursor.com>
- Jetty: enforce allowedHosts with a servlet filter returning 403 before the WebSocket upgrade, instead of a builder field that was never read (#99) - Spring Boot: enforce arcp.middleware.allowed-hosts with a HandshakeInterceptor rejecting disallowed Hosts with 403 (#99) - Jakarta: record the host decision per handshake and close the session with VIOLATED_POLICY before ArcpRuntime.accept, instead of merely clearing response headers (which the container ignored) (#100) - Vert.x: observe the async write result and fail the transport on write error so the runtime detects a dead session instead of silently dropping events (#110) Co-authored-by: Cursor <cursoragent@cursor.com>
An unexpected transport drop now parks the session for the resume window instead of cancelling its in-flight jobs: job ownership, event history, and the resume buffer are retained and registered by resume token. A session.hello carrying a valid resume_token (and matching principal) reattaches the new transport to the existing session identity, replays buffered envelopes with event_seq greater than last_event_seq, and continues delivering live events; an unknown or expired token returns RESUME_WINDOW_EXPIRED. Explicit session.close still cancels in-flight jobs. Parked sessions are torn down when the resume window elapses or the runtime closes. Adds resume + explicit-close tests. Co-authored-by: Cursor <cursoragent@cursor.com>
…ng (#80, #82) Begin splitting the oversized SessionLoop by moving two cohesive responsibilities into their own tested types: IdempotencyFingerprint (canonical 7.2 hashing) and JobListing (6.6 filter/sort/paging). SessionLoop no longer contains the list-jobs implementation directly. Partial: the full job-execution/submit/ subscribe extraction and the under-800-line target remain follow-up work, deferred to avoid regression risk to the behavioral fixes in this PR. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Warning Review limit reached
More reviews will be available in 49 minutes and 29 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR implements session resume resilience, credential scheme flexibility, host allowlist security, and idempotent job submission. The changes span message protocol extensions, client-side blocking behavior with timeout, lease delegation validation, bounded event history, idempotency fingerprinting, session parking across unexpected transport drops, and host validation across multiple middleware frameworks. ChangesSession Resume, Resilience, and Credential Security
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
CI's spotless:check gate (JDK 21) flagged formatting drift in the audit-fix commits, which were verified locally with -Darcp.skip.spotless=true on JDK 25. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The resume semantics introduced for #22 broke the resume example and the stream-resume recipe in two ways: - ArcpClient dropped a top-level RESUME_WINDOW_EXPIRED error that arrives before session.welcome as uncorrelated, leaving connect() to hang until its timeout. The handshake is now rejected with the decoded error so callers can fall back to a fresh session (and the in-memory transport never completes the client's inbound stream on a runtime-side close, so the existing transport-closed fallback could not catch this). - examples/resume and recipes/stream-resume closed the first client gracefully, which cancels the session (§6.7) and invalidates its resume token. They now authenticate with a stable bearer principal, drop the transport without session.close so the runtime parks the session for the resume window, and await the parked phase before reconnecting. Adds a ClientAuditTest regression test asserting connect() surfaces RESUME_WINDOW_EXPIRED promptly for an unknown token. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Integer.parseInt on a valid-Base64 but non-numeric cursor threw a raw NumberFormatException whose parser message was echoed to the client in INVALID_REQUEST. Both decode failures now surface as a uniform "invalid cursor" IllegalArgumentException. Addresses the code-quality review finding on decodeCursor. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| case JobSubscribe sub -> handleSubscribe(sub); | ||
| case JobSubscribe sub -> handleSubscribe(envelope, sub); | ||
| case JobUnsubscribe unsub -> handleUnsubscribe(unsub); | ||
| case SessionClosed ignored -> log.warn("client-only message received: {}", m); |
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
javadoc:javadoc alone does not compile, so arcp-runtime's javadoc resolved arcp-core from the runner's cached SNAPSHOT — which lacks classes a PR adds upstream (here JobCancelled/SessionClosed from #95, #96). Run package in the same invocation so inter-module dependencies come from the reactor. Verified against a cold local repository. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
…oc, 80% coverage gate (#114) * fix(runtime): make credential issuance non-blocking on the dispatch thread (#109) acceptJob ran the provisioner's CompletableFuture to completion with join() inside onNext, so a slow upstream key-minting call starved every inbound message on the session: pings went unanswered, heartbeat.onInbound never advanced, and a stall beyond 2x heartbeat_interval_sec had the session reaped as HEARTBEAT_LOST mid-submit. One slow job.submit also head-of-line-blocked cancels and acks for every other job. Acceptance is now split: the prologue (resolve, register) stays on the dispatch thread; the epilogue (attach credentials, job.accepted, worker start, watchdogs) runs when issuance completes, chained through a per-session acceptSequence so job.accepted preserves submit order — the FIFO correlation clients rely on, which is why this was deferred in #112. Idempotent replays join the same chain so a replayed acceptance cannot overtake the original and always observes the captured budget (#79). If the session dies or the job is cancelled while issuance is in flight, freshly minted credentials are revoked and nothing is announced; the lease-expiry watchdog measures its delay from the current clock so issuance latency cannot postpone an absolute expires_at. AsyncAcceptanceTest covers both acceptance criteria: pongs continue through a >2x-interval provisioner stall, and a slow first issuance does not let a later acceptance overtake it. Closes #109 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: correct spec-section citations to existing sections Closes #81 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: complete Javadoc for exported APIs and enforce doclint Document every public/protected type, constructor, method, record component, and enum constant across the ten published modules (arcp-core, arcp-client, arcp-runtime, arcp, arcp-otel, arcp-runtime-jetty, arcp-middleware-jakarta, arcp-middleware-spring-boot, arcp-middleware-vertx, arcp-tck): ~575 doclint findings fixed, including two broken {@link} references. Enforce it in the root pom: maven-javadoc-plugin now runs with doclint `all` (including `missing`) and failOnWarnings=true, so the CI javadoc job fails on any undocumented exported API. The gate is overridable for local triage via -Darcp.javadoc.failOnWarnings=false. A few doclint "use of default constructor" warnings required adding explicit, documented no-arg constructors with semantics identical to the implicit ones; no other code changed. Closes #34 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(middleware): cover host-allowlist, handshake, and transport-failure branches (#33) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(runtime): raise branch coverage for session/job/credential paths (#33) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(client,core): raise branch coverage for client correlation, transport, and core wire paths (#33) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(otel): cover tracing-transport branches; enforce 80% line+branch coverage (#33) arcp-otel was the last module under the bar (66.7% branch): the new ArcpOtelBranchTest covers send-failure span recording, optional envelope attributes, trace-context inject (no-op propagator, merge into existing extensions, non-object extensions left untouched) and extract (malformed shapes fall back to Context.current, valid traceparent becomes the receive span's parent), publisher caching, close ordering, and upstream completion propagation. With every library module now >=80% on both counters (line 97.4%, branch 94.7% overall), jacoco:check enforces a 0.80 line+branch BUNDLE minimum at verify time. Library modules opt in via arcp.skip.coverage.check=false; examples/recipes and the sourceless arcp umbrella stay exempt, and check self-skips when tests were skipped (no execution data), so -DskipTests CI jobs are unaffected. Closes #33 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Summary
Fixes the non-docs audit findings across
arcp-core,arcp-runtime,arcp-client, and the middleware adapters: spec-conformance gaps, security leaks, resource/lifecycle bugs, a high-severity resume bug, and a partial refactor ofSessionLoop. All modules build andmvn testis green (run with-Darcp.skip.spotless=truelocally on JDK 25, per the pom note; CI runs Spotless on JDK 21).Spec-conformance
final_status: "error"withLEASE_EXPIRED(terminal statusERROR).max_runtime_secis accepted but never enforced;TIMEOUTis never produced (§7.1 / §12) #89 Enforcemax_runtime_secvia a watchdog that emitsTIMEOUT/final_status: "timed_out"; both watchdogs are cancelled when the job ends.DUPLICATE_KEY.inputchanges the SHA-256 (§7.2) #91 Idempotency fingerprint sortsObjectNodeproperties, so key-order-only differences hash identically.event_seqspace.job.submit(e.g. non-UTCexpires_at) now returnsINVALID_REQUESTinstead of being silently dropped.job.result/job.error(§7.6) #93 Terminaljob.result/job.errorfan out to subscriber sessions (with their own seq, no credentials).job.cancelis never acknowledged: nojob.cancelledmessage exists, and invalid cancels are silently dropped (§7.4) #95 Addsjob.cancelledack; cancel of an unknown/invisible job returnsJOB_NOT_FOUND, missingjob_idreturnsINVALID_REQUEST— no silent drops.session.byeand never acknowledges; spec requiressession.close/session.closed(§6.7) #96 Addssession.close/session.closed(withsession.byeaccepted as a deprecated alias) and acknowledges before teardown.schemethrows during decode, dropping the wholejob.acceptedand hanging the submitter (§9.8.1) #97 Unknown credentialschemedecodes toUNKNOWNinstead of throwing; the client ignores unrecognized-scheme credentials, so a single extension credential can no longer deadlock the submit path.list_jobsonly materializesJobSummaryfor the requested page.Lease.containscomparescost.budgetnumerically; adds aLeaseSubsetvalidator for delegation subset enforcement (capabilities, numeric budget, model.use, expiry) throwingLEASE_SUBSET_VIOLATION. (Partial — see Notes.)cost.budget.remaininggauges are treated as spend and decrement the budget (§9.6) #108cost.budget.*gauges are no longer treated as spend; null metric values are guarded.Security
credential_rotatedis delivered only to the submitting session, never fanned out to subscribers.job.subscribehistory replay forwards raw producer-session envelopes — wrongsession_id, foreignevent_seq, and replayedcredential_rotatedsecrets (§7.6 / §8.3 / §14) #94 Subscribe-history replay re-envelopes events with the subscriber'ssession_idand freshly allocatedevent_seq, and never replays credential material to a non-owning subscriber.rotateCredentialleaks extra provisioned credentials and can revoke the rotated credential's own provider handle (§9.8.2 / §14) #98rotateCredentialrecords/revokes surplus minted credentials and fails loudly on an empty issue rather than aliasing (and then revoking) a live handle.allowedHostsis dead configuration in ArcpJettyServer and the Spring Boot middleware #99allowedHostsis enforced (Jetty servlet filter + SpringHandshakeInterceptor) returning 403 before the upgrade, instead of being a silently ignored knob.VIOLATED_POLICYbeforeArcpRuntime.accept, instead of merely clearing response headers (which the container ignores).Bugs / lifecycle
resume_tokenreattaches the new transport to the same identity, replays missed events, and continues live delivery. Explicitsession.closestill cancels. Unknown/expired tokens returnRESUME_WINDOW_EXPIRED.runJobcatch blocks emit a second terminaljob.errorafter cancel/expiry already terminated the job #92runJobcatch blocks guard terminal emission so a job can never emit two terminal messages.JobRecord.eventHistoryis unlimited and terminal jobs are never evicted #101 Per-job event history is bounded (resume-buffer capacity), terminal jobs are evicted after the resume window, and closed sessions are unlinked from subscriber lists.job.errorarrives for list_jobs/subscribe failures #102 Top-level errors echo the originating request id; the client correlates them, so alist_jobs/subscribeerror never fails an unrelated in-flight submit andlistJobssurfaces its ownINVALID_REQUESTinstead of timing out.doHandshakecan resurrect a CLOSED session (phase.set(ACTIVE)) and leak a never-cancelled heartbeat task #103 Handshake usescompareAndSetand a volatile heartbeat handle, so a closed session cannot be resurrected and no orphaned heartbeat task leaks.setWorker, losing the interrupt;cancelled()ignores TIMED_OUT so the agent never stops #104 The worker is started before the watchdog is scheduled, andcancelled()covers all terminal states, so a polling agent exits after lease-expiry termination.ArcpClientnever completessubscribe()publishers when the job reaches a terminal state #105 The client completes (onComplete/onError) livesubscribe()publishers when the job terminates and releases per-job executors.ArcpClient.submitblocks indefinitely withjoin()and deadlocks when called from a completion callback #106 Blockingsubmitis bounded by a configurable timeout and fails fast if called from a dispatch/result callback; addssubmitAsync.ArcpRuntime.acceptregisters the session afterstart(), racing transport completion into a zombie map entry #107ArcpRuntime.acceptregisters the session beforestart()to avoid a zombie map entry on synchronous transport completion.sendis fire-and-forget: write failures are silently lost, so the runtime never detects a dead session #110 The Vert.x transport observes the async write result and fails the transport on write error instead of silently dropping events.ResultStreamrejects divergent duplicate pending chunks (tolerating byte-identical retransmissions).Refactor / testing
IdempotencyFingerprintandJobListingfromSessionLoopinto cohesive, separately-tested types;SessionLoopno longer contains the list-jobs implementation. The full job-execution/submit/subscribe extraction and the <800-line target are deferred to keep this PR's behavioral changes low-risk.SessionLoopAuditTest,SessionResumeTest,ClientAuditTest,LeaseSubsetTest,IdempotencyFingerprintTest,JobListingTest, plus extensions to existing tests). No JaCoCo threshold change.Notes / partial
LeaseSubsetvalidator are implemented and tested, but end-to-end runtime auto-enforcement is not wired because this SDK uses client-driven manual delegation (the runtime does not derive child leases fromDelegateEvent, which carries no lease). The validator is ready for a future delegate-request flow.join()blocks the session dispatch thread, starving heartbeats and all other messages #109 (async credential issuance during accept) is intentionally not changed: making accept asynchronous would break the client's FIFOjob.acceptedcorrelation without a per-session sequencing layer; deferred to avoid that regression risk. The blockingissue().join()remains.Test plan
mvn -DskipTests -Darcp.skip.spotless=true installmvn -Darcp.skip.spotless=true test(all modules green; one pre-existing-style timing flake inSubscribeReplayTestis auto-reran green by the repo's surefire config)Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests